Red Wine

Red Wine

Introduction

We are going to investigate the Red Wine dataset on physicochemical properties and quality ratings. We will be analyzing a dataset with 1,599 red wine samples from the north of Portugal. Each wine sample comes with a quality rating from one to ten, and results from several physical chemical tests, such as: alcohol content, acidity level and residual sugar. There are 11 columns describing their chemical properties, and a column for quality ratings.

Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

For more information click here.

In this project our main goal is to discover which chemical propeties influence the quality of red wines and to understand how these characteristics influence the quality. After that, we will create a model to predict the quality of wine.

Wrangle data

Cleaning

Since the text file about the dataset mentioned that there are no missing values, we will check only for duplicated values.

## [1] 0

Note: There is not duplicated rows.

Evaluate the shape, structure and summary about the dataset (respectively)

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Notes: Our dataset consists of ten variables, with almost 1,600 observations.

Univariate Analysis

Quality

Notes: Most wine samples are of 5 and 6 (almost 80% of the dataset). Moreover, it seems to be that wines which received the highest score (8) have a few observations, and this situation repeats in the lowest level( 3, 4). Wines with a score of 7 are 200 observations. Then, I will compare the median and mean of physicochemical properties for the 3, 5, 6, 8 quality levels to understand the main differences among them.

Mean and Median analysis for the highest (8), the average (6,5) and the lowest (3) quality scores

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Mean

##   quality fixed.acidity volatile.acidity citric.acid residual.sugar
## 1       3      8.360000        0.8845000   0.1710000       2.635000
## 2       5      8.254284        0.5385595   0.2582638       2.503867
## 3       8      8.566667        0.4233333   0.3911111       2.577778
##    chlorides free.sulfur.dioxide total.sulfur.dioxide   density       pH
## 1 0.12250000            11.00000             24.90000 0.9974640 3.398000
## 2 0.08897271            16.36846             48.94693 0.9968673 3.311296
## 3 0.06844444            13.27778             33.44444 0.9952122 3.267222
##   sulphates  alcohol
## 1 0.5700000  9.95500
## 2 0.6472631 10.25272
## 3 0.7677778 12.09444

Median

##   quality fixed.acidity volatile.acidity citric.acid residual.sugar
## 1       3          7.50            0.845       0.035            2.1
## 2       5          7.80            0.540       0.240            2.2
## 3       8          8.25            0.370       0.420            2.1
##   chlorides free.sulfur.dioxide total.sulfur.dioxide  density   pH
## 1    0.0905                 6.0                 15.0 0.997565 3.39
## 2    0.0800                14.0                 40.0 0.996800 3.31
## 3    0.0705                 7.5                 21.5 0.994940 3.23
##   sulphates alcohol
## 1     0.545   9.925
## 2     0.610  10.000
## 3     0.740  12.150

Notes: There are significant variations of mean and median with sulfur dioxide (free and total sulfur) and acidity (fixed, volatile and citric) variables and residual sugar. In order to avoid problems with outliers, we will consider only the median. Later, we will investigate all variables with a histogram and box plot.

The wine samples with the highest score have the lowest level of density, volatile acidity, pH, and sugar (the lowest score has the same median). Furthermore, they have highest level of alcohol, fixed acidity, citric acid, sulphates and flat level for sulfur dioxides.

What attributes increase values with a better rating?

Alcohol, fixed acidity, citric acid, sulphates

What attributes decrease values with a better rating?

Density, volatile acidity, pH, and sugar

Fixed Acidity

Distribution: Right-skewed, there is a tail, but it is not a long one.

Outliers: The boxplot shows a few outliers from 12 to 16.

Volatile Acidity

Distribution: The first histogram appears bimodal with peaks around 0.4 and 0.6, but when we zoom into the histogram with log10, it seems to be left-skewed distribution. Further, there is a long tail.

Outliers: There are a few outliers between the higher range, around 1.0 to 16.0

Citric Acid

Distribution: The first histogram appears right-skewed with high peak around 0.50 and with a short tail. When we use the histogram with log10, it seems to be the distribution changes the direction to the left and creates a long tail.

Outliers: There is one outlier.

Sugar

Notes: Since we could not clearly see the distribution of sugar because of the long tail, we will create new a histogram with breaks and limits.

Distribution: It has a right-skewed distribution with a long tail (from 4 to 16). The distribution with log10 appears right-skewed with a long tail to the right.

Outliers: There are many outliers with a large range (from 4 to 16).

Chlorides

Notes: Since we could not clearly see the distribution of chlorides because of the long tail, we will create a new histogram with breaks and limits.

Distribution: It has symmetrical distribution with a long tail (from 0.2 to 0.6). The distribution with log10 appears symmetrical with a long tail to the right. Afterwards, adjusting the distribution continues to be symmetrical.

Outliers: There are many upper outliers with an extensitve range (around 0.1 and 0.6) and a few lower outliers (around 0.0 and 0.05).

Free Sulfur Dioxide

Distribution: The first histogram appears right-skewed, but when we zoom into the histogram with log10, it seems to be bimodal with peaks around 5 and 10. Further, there is also a short tail.

Outliers: There are a few outliers with a range around 40 to 60.

Total Sulfur Dioxide

Distribution: The first histogram appears right-skewed, but when we zoom into the histogram with log10, it seems to be symmetrical. Furthermore, there is a long tail.

Outliers: There are many outliers with a higher range, around 120 to 300.

Density

Distribution: The distribution of density is symmetrical for all plots.

Outliers: There are a few upper and lower outliers.

pH

Distribution: The distribution of density is symmetrical for all plots.

Outliers: There are a few upper and lower outliers.

Sulphates

Distribution: It right-skewed with a long tail.

Outliers: There are many outliers with a higher range, around 1.0 to 2.0.

Alcohol

Distribution: It right-skewed with a short tail.

Outliers: There are a few outliers.

Univariate Analysis Summary

In this analysis, we investigated the distribution and outliers through a histogram, a histogram with a log10 scale, density chart, and box plot for all variables. I found three different shapes of data: symmetrical, right- and left-skewed. Furthermore, we discovered that some variables have many outliers, such as sugar and chlorides. After this, we will research in-depth and decide if we will remove or keep the outliers because this situation may or may not affect our analysis.

What is the structure of your dataset?

        (low)---------->(high) 

quality: 3, 4, 5, 6, 7, 8 Other observations: > Almost of 80% wines have an average score > Alcohol, fixed acidity, citric acid, sulphates increase with a better rating. > Density, volatile acidity, pH, and sugar decrease with a better rating.

What is/are the main feature(s) of interest in your dataset?

Since we are interested in the output based on sensory data, the main characteristics will be a factor that could be noted by wine experts. These factors are: quality, alcohol, citric acid, density, sugar, total sulfur dioxide, volatile acidity.

Which variables have the distribution right-skewed?

Alcohol, Citric Acid, Free Sulfur Dioxide, Fixed Acidity, Sugar, Sulphates, Total Sulfur Dioxide.

Which variables have the distribution left-skewed?

Volatile Acidity(log10).

Which variables have the distribution symmetrical or bimodal?

Chlorides, Density, Volatile Acidity(bimodal), pH.

Which variables have none or a few outliers?

Fixed Acidity, Volatile Acidity, Citric Acid, Free Sulfur Dioxide, pH.

Which variables have many outliers?

Sugar, Chlorides, Total Sulfur Dioxide, Sulphates.

Bivariate Analysis

Correlation Matrix

Notes: The variable quality has a moderate positive correlation with alcohol (0.5), and a moderate negative correlation with volatile acidity (-0.4). Other correlations: density has a moderate negative correlation with alcohol (-0.5), and a strong positive correlation with residual sugar, citric acid and fixed acidity. The pH variable has a strong negative correlation with citric acid and fixed acidity. Moreover, total sulfur dioxide has a strong positive correlation with free sulfur dioxide (0.7). Finally, citric acid has a strong positive correlation with volatile acidity (0.6) and fixed acidity (0.7). Let’s check these correlations with the correlation network analysis below.

Correlation Network

Notes: We can easily see the relationship between all variables with the correlation network. The blue line means a positive correlation and the red line means a negative correlation. Light color is a weak correlation and dark color means a strong correlation.

Since we are interested in understanding which attributes have a meaningful correlation with the quality of red wine, we will focus on this type of correlation. Futher, we would like to interpret the attributes which have a significant correlation with quality, and other attributes. For example: the quality has a moderate correlation with alcohol and volatile acidity. In addition, alcohol has a moderate correlation with density, which has a moderate correlation with sugar and a strong correlation with fixed acidity.

Quality vs Alcohol

## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Notes: The trend between alcohol and quality is clearer, with the highest quality score having the largest median. In other words, the amount of alcohol increases with better quality raking. Additionally, most outliers have a score of 5, and that explains why the median is lower than score of 4.

Quality vs Volatile Acidity

## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

Notes: Volatile acidity shows an opposite trend, the worst score quality having the largest median. By way of explanation, the amount of volatile acidity decreases with a better quality raking.

Density vs Alcohol scatter plot

Notes: There is a moderate negative correlation between density vs alcohol. That means when the level of alcohol decreases, the density increases. We can see this pattern with the line models.

Density vs Sugar scatter plot

Notes: Comparing sugar with density, they have a weak positive correlation and the scatter plot suffers from some overplotting. Most of the points are concentrated below 5 g / dm^3. However, we can see with the line model a slight increased trend. So, as the sugar rises, the density grows slightly.

Density vs Citric Acid scatter plot

Notes: The correlation between density vs citric acid is similar when we compare sugar with density. They have a weak positive correlation and the scatter plot suffers from some overplotting. Notice that the points are more dispersed in the y axis. Finally, we can see the same trend as with sugar and density.

Density vs Fixed.acidity scatter plot

Notes: There is a strong positive correlation between density and fixed acidity . That means the level of fixed acidity increases as the density increases. We can see this trend with the line models.

Volatile vs Citric Acid scatter plot

Notes: Volatite acidity and citric acid have a moderate negative correlation. We can also observe volatite acidity rising as citric acid grows.

Bivariate Analysis Summary

In this part, we investigated the relationship between all variables. Since quality, which is a categorical ordinal variable, we analyzed it vs continuous variable with a box plot. Then, we examined continuous variables between each other with a scatter plot. We further applied a regression line that allows us to understand the pattern between these correlations.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The box plots show how alcohol and volatile acidity influence the quality rating. In others words, the quality variable has only a moderate positive correlation with alcohol (0.5), and a moderate negative correlation with volatile acidity (-0.4). Quality does not have other moderate or strong correlations.

In addition, we saw how alcohol, citric acid and fixed acidity affect the variable density. Density has a moderate negative correlation with alcohol (-0.5), and a strong positive correlation with residual sugar, citric acid and fixed acidity. The pH variable has a strong negative correlation with citric acid and fixed acidity.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The acid features tend to correlate with each other. The same occurs with free sulfur dioxide and total sulfur dioxide.

What was the strongest relationship you found?

Further, total sulfur dioxide has a strong positive correlation with free sulfur dioxide (0.7). Finally, citric acid has a strong positive correlation with volatile acidity (0.6) and fixed acidity (0.7).

Multivariate Analysis

Density VS Alcohol

Notes: Most of the samples with a 7 and 8 quality score appear above 10% of alcohol and below 0.997 of density. Wines with a low score are distributed around 8% to 10% of alcohol and a density concentration of about 0.995 to 1.000.

Notes: Comparing density VS alcohol by quality, we can see a moderate negative correlation between density and alcohol by quality scores, except a score of 5. The samples with a score of 5 show a weak negative correlation.

Density vs Sugar

Notes: Based on the scatterplot, most of the quality scores are clustering below 4 and dispersed in density, with a range of about 0.93 to 1.10. That means the majority of wines are not sweet and this situation does not reflect why the density attribute is spread out.

Notes: The density and sugar have a weak correlation when analyzed by quality, except for a score of 5, which has a moderate positive correlation.

Density vs Fixed acidity

Notes: Density vs Fixed acidity analyzed by quality show a strong positive linear correlation. We can see fixed acidity increases as density increases.

Notes: When we compare density vs fixed acidity with regression, the plots show a linear trend, and all scores have a strong positive correlation with the highlight being a score of 8 that has a 0.85 correlation.

Density vs Citric Acid

Notes: Density vs Fixed acidity analyzed by quality shows a weak correlation and most of the samples with a 7 and 8 quality score appear above 0.25 of citric acid while most of the samples with a low score seem to be below 0.25. Moreover, both quality scores seem to be distributed around 0.993 to 1.0 of density.

Notes: Comparing density VS citric acid analyzed for each quality scores separately, we can see a strong positive correlation score of 8 (0.82) and 3 (0.78). Furthermore, the other correlations show a moderate/weak behavior.

Volatile vs Citric Acid

Notes: It seems to be that most of the observations with a higher score (7.6) are about 0.25 to 0.75, and are distributed in volite acidity between 0.2 and 0.6. In contrast, the low score observations are more concentrated below 0.25 and are distributed in volite acidity between 0.4 and 0.8.

Notes: Volatile vs citric acid show a moderate negative correlation, except for the lowest quality score, which has a strong negative correlation.

Multivariate Analysis

In this section, I investigated the alcohol vs density, volatile acidity vs citric acid, and alcohol vs volatile acidity analyzed by quality. My focus here was to understand how the quality scores change these correlations.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There is a strong positive correlation between density vs fixed acidity and density vs citri acid, and I found correlations up to 0.78. The other comparisons, such as density vs alcohol, had a moderate/weak correlation.

Were there any interesting or surprising interactions between features?

Since we saw on attributes descriptions that sugar has an influence in the density variable, and I was surprised about the weak correlation between both when analyzed by quality.

Building a classification model to predict quality level of wine

Since our main output variable (quality) is a categorical ordinal, I will build a classification model with non-linear (CART, kNN) and complex non-linear (SVM, RF) methods.

Let’s evaluate 4 different algorithms:

Classification and Regression Trees (CART). k-Nearest Neighbors (kNN). Support Vector Machines (SVM) with a linear kernel. Random Forest (RF)

Create a Validation Dataset

We need to know if the model we created is any good.

Our model will predict if the wine is high, medium or low quality. First, I will create a new column (quality_levels) with three levels (high, medium and low). Then, I will cut the quality column based on these three levels. Later, I will create a data partition with 80% of the original data set for training the model. Afterwards, I will use the other 20% of the original data set for testing the models.

## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality_levels      : Factor w/ 3 levels "low","medium",..: 2 2 2 2 2 2 2 3 3 2 ...

Notes: I decided to keep the outliers in the sugar attribute once the oultiers are true.

Test Harness

We will use a 10-fold crossvalidation to estimate accuracy.

This will split our dataset into ten parts, train in nine, and test one and release them for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into ten groups in an effort to get a more accurate estimate.

Select Best Model

We can report on the accuracy of each model by first creating a list of the created models and using the summary function.

## 
## Call:
## summary.resamples(object = results)
## 
## Models: cart, knn, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cart 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000    0
## knn  0.8113208 0.8167443 0.8270440 0.8275826 0.8372445 0.8481013    0
## svm  0.9685535 0.9827044 0.9937303 0.9899450 0.9984375 1.0000000    0
## rf   1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000    0
## 
## Kappa 
##             Min.    1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## cart  1.00000000 1.00000000 1.0000000 1.0000000 1.0000000 1.0000000    0
## knn  -0.01112878 0.06706928 0.1275189 0.1201387 0.1634461 0.2238806    0
## svm   0.88246600 0.93791877 0.9783921 0.9636877 0.9946879 1.0000000    0
## rf    1.00000000 1.00000000 1.0000000 1.0000000 1.0000000 1.0000000    0

Notes: CART and RF models achieve 100% accuracy. Now, I will summarize both models to figure out which one is the best.

Summarize the Best Model

## CART 
## 
## 1599 samples
##   13 predictor
##    3 classes: 'low', 'medium', 'high' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1430, 1430, 1430, 1429, 1430, 1430, ... 
## Resampling results across tuning parameters:
## 
##   cp         Accuracy   Kappa    
##   0.0000000  1.0000000  1.0000000
##   0.1962963  0.9887028  0.9572826
##   0.8037037  0.8702529  0.2572826
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
## Random Forest 
## 
## 1599 samples
##   13 predictor
##    3 classes: 'low', 'medium', 'high' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1431, 1430, 1430, 1430, 1430, 1430, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9918278  0.9710174
##    7    1.0000000  1.0000000
##   13    1.0000000  1.0000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.

Notes: The random forest model was more accurate than the classification and regression trees model.

Make Predictions

The RF was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set (20% of our original dataset). This will give us an independent final check on the accuracy of the best model. We will run the RF model directly on the validation set and summarize the results in a confusion matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction low medium high
##     low     10      0    0
##     medium   0    263    0
##     high     0      0   43
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9884, 1)
##     No Information Rate : 0.8323     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: low Class: medium Class: high
## Sensitivity             1.00000        1.0000      1.0000
## Specificity             1.00000        1.0000      1.0000
## Pos Pred Value          1.00000        1.0000      1.0000
## Neg Pred Value          1.00000        1.0000      1.0000
## Prevalence              0.03165        0.8323      0.1361
## Detection Rate          0.03165        0.8323      0.1361
## Detection Prevalence    0.03165        0.8323      0.1361
## Balanced Accuracy       1.00000        1.0000      1.0000
## [1] 318  14

Notes : Our model gets 100% accuracy in the test with the validation set. It is important to remember that the validation only contains a small part (20% or 318 rows) of our original dataset. This explains why our accuracy was so high. We need to test this model in other wine data sets to evaluate if it is a reliably accurate model.

Final Plots and Summary

Plot One

Description One

Most wine samples are of 6 and 5 (almost 80% of the dataset). Moreover, it seems to be that wines which received the highest score (8) have a few observations. This situation repeats at a low level (3, 4). Wines with a score of 7 have 200 observations.

Plot two

## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Description two

The trend between alcohol and quality is clearer, with the highest quality score having the largest median. In other words, the amont of alcohol increases with a better quality raking. Also, most outliers have a score of 5, and that explains why the median is lower than a score of 4.

Plot three

Description three

Most of the samples with 7 and 8 quality scores appear above 10% of alcohol and below 0.997 of density. Wines with a low score are distributed around 8% to 10% of alcohol and a density concentration about 0.995 to 1.000.

Comparing density vs alcohol analyzed by quality, we can see a moderate negative correlation between density and alcohol analyzed by quality scores, except for a score of 5. The samples with a score of 5 show a weak negative correlation.

Reflection

Where did I run into difficulties in the analysis?

In this project, the main challenge was to achieve good EDA with a small data set. For example, I was really interested in exploring wines with the highest quality level (8), but the observations at this level correspond to 1% of all samples. In my opinion, it was insufficient to gather reliable a result.

Furthermore, I had problems applying machine learning methods. Since the quality variable is a categorical type, I supposed the best output would be a classification method, but this project is about finding relationships rather than do predictions.

Where did I find successes?

I attained positive results in investigating the variables separately and in the relationships between all the variables. I was surprised how some variables positively or negatively affect the quality of the wine.

How could the analysis be enriched in future work (e.g. additional data and analyses)?

I believe that including variables like grape type, brand, price, country and adding more data, the analysis would be outstanding.

Conclusions

The red wine data set contains information on almost 1,600 red wine samples across 12 chemical properties from 2009. First, I use descriptive statistics with a histogram, histogram log10, density and box plots in each separated variable. Almost 80% of our dataset received an average score (5,6) and the highest score (8) holds only 1% (18 rows) of observations. Moreover, I analyzed the mean and median for the highest, average and lowest quality scores. The mean was not totally reliable in a few attributes as sugar and chlorides. These attributes had a significant difference from the median, and this situation is common for outliers. So, I preferred to use only the median to gather the main caracteristics for the quality levels. The wine with the highest score has the following features:

High level of: alcohol, fixed acidity, citric acid, sulphates Medium level of: sulfur dioxides Low level of: density, volatile acidity, pH, and sugar

Later, we investigated the correlation between all variables through the correlation matrix, correlation network, boxplot, scatter plot and scatter plot with fitting regression. The quality variable had a meaningful correlation with alcohol and volatile acidity. We noted a positive trend between quality and alcohol, so the level of alcohol increases with a better quality ranking. The opposite occurs with quality and volatile acidity, and the level of volatile acidity decreases with a better quality ranking. This completely makes sense because high levels of volatile acidity lead to an unpleasant, vinegar taste. Additionally, we were surprised with the residual sugar variable because before the analysis we thought that sugar could be one of the most important attributes, but it showed a non-correlation with quality. Besides that, we saw how alcohol and volatile acidity (the attributes correlated with quality) were influenced or influence other variables, such as density.

Along with multivariate analysis, we analyzed alcohol vs density, volatile acidity vs citric acid and alcohol vs volatile acidity interpreted by quality with scatter plot and regression model. We discovered that there was a strong positive correlation between density vs fixed acidity and density vs citric acid. In addition, we found a moderate/weak correlation for density vs alcohol.

Finally, we built four classification models to predict whether wine is high, medium or low quality. The best model was developed with random forest method, this model achieves 100% accuracy in our test with the validation data set. As our validation data set contains only 318 observations, we recommend testing this model in other wine data sets to evaluate if it is a reliable model.

Reference

http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software

http://jamesmarquezportfolio.com/correlation_matrices_in_r.html

http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf

https://www.purplemath.com/modules/scattreg2.htm

http://www.sthda.com/english/rpkgs/ggpubr/reference/ggscatter.html

https://en.wikipedia.org/wiki/Sweetness_of_wine

https://www.winecompass.com.au/blog/residual-sugar-wine/

https://machinelearningmastery.com/machine-learning-in-r-step-by-step/